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Abstract 

A self-stabilizing simulation of a single-writer multi -reader atomic register is presented. The simula- 
tion works in asynchronous message-passing systems, and allows processes to crash, as long as at least a 
majority of them remain working. A key element in the simulation is a new combinatorial construction 
of a bounded labeling scheme that can accommodate arbitrary labels, i.e., including those not generated 
by the scheme itself. 
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1 Introduction 

Distributed systems have become an integral part of virtually all computing systems, especially those of 
large scale. These systems must provide high availability and reliability in the presence of failures, which 
could be either permanent or transient. 

A core abstraction for many distributed algorithms simulates shared memory [3 j ; this abstraction allows 
to take algorithms designed for shared memory, and port them to asynchronous message-passing systems, 
even in the presence of failures. There has been significant work on creating such simulations, under var- 
ious types of permanent failures, as well as on exploiting this abstraction in order to derive algorithms for 
message-passing systems. (See a recent survey El.) 

All these works, however, only consider permanent failures, neglecting to incorporate mechanisms for 
handling transient failures. Such failures may result from incorrect initialization of the system, or from 
temporary violations of the assumptions made by the system designer, for example the assumption that 
a corrupted message is always identified by an error detection code. The ability to automatically resume 
normal operation following transient failures, namely to be self-stabilizing HI, is an essential property that 
should be integrated into the design and implementation of systems. 

This paper presents the first practically self-stabilizing simulation of shared memory that tolerates 
crashes. Specifically, we propose a single-writer multi-reader (SWMR) atomic register in asynchronous 
message-passing systems where less than a majority of processors may crash. A single-writer multi-reader 
register is atomic if each read operation returns the value of the most recent write operation happened before 
it or the value written by a concurrent write and once a certain read returns a value and subsequent read 
returns the same or later value. 

The simulation is based on reads and writes to a (majority) quorum in a system with a fully connected 
graph topologjlj. A key component of the simulation is a new bounded labeling scheme that needs no 
initialization, as well as a method for using it when communication links and processes are started at an 
arbitrary state. 

Overview of our simulation. Attiya, Bar-Noy and Dolev presented the first simulation of a SWMR 
atomic register in a message-passing system, supporting two procedures, read and write, for accessing the 
register. This simple simulation is based on a quorum approach: In a write operation, the writer makes sure 
that a quorum of processors (consisting of a majority of the processors, in its simplest variant) store its latest 
value. In a read operation, a reader contacts a quorum of processors, and obtains the latest values they store 
for the register; in order to ensure that other readers do not miss this value, the reader also makes sure that a 
quorum stores its return value. 

A key ingredient of this scheme is the ability to distinguish between older and newer values of the 
register; this is achieved by attaching a sequence number to each register value. In its simplest form, the 
sequence number is an unbounded integer, which is increased whenever the writer generates a new value. 
This solution could be appropriate for a an initialized system, which starts in a consistent configuration, in 
which all sequence numbers are zero, and are only incremented by the writer or forwarded as is by readers. 
In this manner, a 64-bit sequence number will not wrap around for a number of writes that is practically 
infinite, certainly longer than the life-span of any reasonable system. 



'Note that the use of standard end-to-end schemes can be used to implement the quorum operation in the case of general 
communication graph. 



However, when there are transient failures in the system, as is the case in the context of self-stabilization, 
the simulation starts at an uninitialized state, where sequence numbers are not necessarily all zero. It is 
possible that, due to a transient failure, the sequence numbers might hold the maximal value when the 
simulation starts running, and thus, will wrap around very quickly. 

Our solution is to partition the execution of the simulation into epochs, namely periods during which 
the sequence numbers are supposed not to wrap around. Whenever a "corrupted" sequence number is 
discovered, a new epoch is started, overriding all previous epochs; this repeats until no more corrupted 
sequence numbers are hidden in the system, and the system stabilizes. Ideally, in this steady state, after the 
system stabilizes, it will remain in the same epoch (at least until all sequence numbers wrap around, which 
is unlikely to happen). 

This raises, naturally, the question of how to label epochs. The natural idea, of using integers, is bound 
to run into the same problems as for the sequence numbers. Instead, we capitalize on another idea from Q, 
of using a bounded labeling scheme for the epochs. A bounded labeling scheme [0CE2 provides a function 
for generating labels (in a bounded domain), and guarantees that two labels can be compared to determine 
the largest among them. 

Existing labeling schemes assume that initially, labels have specific initial values, and that new labels 
are introduced only by means of the label generation function. However, transient failures, of the kind the 
self-stabilizing simulation must withstand, can create incomparable labels, so it is impossible to tell which 
is the largest among them or to pick a new label that is bigger than all of them. 

To address this difficulty, we present a constructive bounded labeling scheme that allows to define a label 
larger than any set of labels, provided that its size is bounded. We assume links have bounded capacity, and 
hence the number of epochs initially hidden in the system is bounded. 

The writer tracks the set of epochs it has seen recently; whenever the writer discovers that its current 
epoch is not the largest, or is incomparable to some existing epoch, the writer generates a new epoch that is 
larger than all the epochs it has. The number of bits required to represent a label depends on m, the maximal 
size of the set, and it is in 0(m log m). We ensure that the size of the set is proportional to the total capacity 
of the communication links, namely, 0(cn 2 ), where c is the bound on the capacity of each link, and hence, 
each epoch requires O ((en 2 (log n + logc)) bits. 

It is possible to reduce this complexity, making c essentially constant, by employing a data-link protocol 
for communication among the processors. 

We show that, after a bounded number of write operations, the results of reads and writes can be totally 
casually ordered in a manner that respects the read-time order of non-overlapping operations, so that the 
sequence of operations satisfies the semantics of a SWMR register. This holds until the sequence numbers 
wrap around, as can happen in a realistic version of the unbounded ABD simulation. 

Related work. Self-stabilizing simulation of an atomic single-writer single-reader shared registers, on a 
message-passing system, was presented in [7]. This simulation does not address SWMR register. Moreover, 
the simulation cannot withstand processor crashes. More recent [6, 13] papers focused on self-stabilizing 
simulation of shared registers using weaker shared registers. Self-stabilizing timestamps implementations 
using SWMR atomic registers were suggested in fTJO. These implementations already assume the existence 
of a shared memory, while, in contrast, we simulate a shared SWMR atomic register using message passing. 



2 Preliminaries 

A message-passing system consists of n processors, Po,Pi,P2, ■ ■ ■ ,Pn-i> connected by communication links 
through which messages are sent and received. We assume that the underlying communication graph is 
completely connected, namely, every pair of processors, pi and pj, have a communication link. 

A processor is modeled by a state machine that executes steps. In each step, the processor changes its 
state, and executes a single communication operation, which is either a send message operation or a receive 
message operation. The communication operation changes the state of an attached link, in the natural 
manner. 

The system configuration is a vector of n states, a state for each processors and 2(n 2 — n) sets, each 
bounded by a constant message capacity c. A set Sij (rather than a queue, reflects the non-fifo nature) for 
each directed edge (i, j) from a processor pi to a processor pj. Note that in the scope of self-stabilization, 
where the system copes with an arbitrary starting configuration, there is no deterministic data-link simulation 
that use bounded memory when the capacity of links is unbounded 0. 

An execution is a sequence of configurations and steps, E = {C\,a\, 6*2,02 . . .) such that Cj, i > 1, 
is obtained by applying Oj_i to Cj_i, where aj_i is a step of a single processor, pj, in the system. Thus, 
the vector of states, except the state of pj, in Cj_i and Ci are identical. In case the single communication 
operation in aj_i is a send operation to pk then Sjk in Q is a union of Sjk in Cj_i with the message sent 
in Oi_i. If the obtained union does not respect the message bound \sjk\ = c then an arbitrary message 
in the obtained union is deleted. The rest of the message sets are kept unchanged. In case, the single 
communication operation in cjj_i is a receive operation of a (non null) message m, then m (must exist in Skj 
of Cj_i and) is removed from Skj, all the rest of the sets are identical in Cj_i and Cj. A receive operation 
by pj from p^ may result in a null message even when the Skj is not empty, thus allowing unbounded delay 
for any particular message. Message losses are modeled by allowing spontaneous message removals from 
the set. An edge (i, j) is operational if a message sent infinitely often by pi is received infinitely often by pj. 

For the simulation of a single writer multi-reader (SWMR) atomic register, we assume po is the writer 
and pi,p2, ■ • • ,Pn-i are the readers, po has a write procedure/operation and the readers have read proce- 
dure/operation. The sub-execution between the step that starts a write procedure and the next step that ends 
the write procedure execution defines a write period. Similarly, for a particular read by processor pi, the 
sub-executions between the step that starts a read procedure by processors pi and the next step that ends the 
read procedure execution of pi defines a read period. 

SWMR atomic register. A single- writer multi-reader atomic register supplies two operations: read and 
write. An invocation of a read or write translates into a sequence of computation steps. A sequence 
of invocations of read and write operations generates an execution in which the computation steps cor- 
responding to different invocations are interleaved. An operation op\ happens before an operation op2 in 
this execution, if op\ returns before op2 is invoked. Two operations overlap if neither of them happens 
before the other. Each interleaved execution of an atomic register is required to be linearizable [15], that 
is, it must be equivalent to an execution in which the operations are executed sequentially, and the order 
of non-overlapping operations is preserved. The main difference between a regular register (a register that 
satisfies the property that every read retuns the value written by the most recent write or by a concurrent 
write) and an atomic register is the absence of new/old inversions. Consider two consecutivd_| reads n, r2 
and two consecutive writes w\, W2 of a regular register such that r\ is concurrent with both w\ and u>2 and 



2 Two operations op\ and op2 are consecutive if op\ is the most recent operation that happens before op2- 



V2 is concurrent only with W2- The regularity property allows r2 to return the value written by w\ and r\ to 
return the value written by W2- This phenomena is called the new/old inversion. 

An atomic register prevents in all executions the new/old inversions. 

Formally, an atomic register verifies the following two properties: 

• Regularity property. A read operation returns either the value written by the most recent write 
operation that happened before the read or a value written by a concurrent write. 

• No new old/inversions If a read operation n reads a value from a concurrent write operation W2 then 
no read operation that happens after r\ reads a value from a write operation w\ that happens before 

w 2 . 

Practically stabilizing SWMR atomic register. A message passing system simulates a SWMR atomic reg- 
ister in a practically stabilizing manner, if any infinite execution starting in arbitrary configuration in which 
the writer writes infinitely often has a sub-execution with a practically infinite number of write operations, 
in which the atomicity requirement holds. A practically infinite execution is an execution of at least 2 k steps, 
for some large k; for example, k = 64 is big enough for any practical system. 

3 Overview of the Algorithm 

3.1 The Basic Quorum-Based Simulation 

We describe the basic simulation, which follows the quorum-based approach of 0, and ensures that our 
algorithm tolerates (crash) failures of less than a majority of the processors. Our simulation assumes the 
existence of an underlying stabilizing data-link protocol, ifTTIl . similar to the ping-pong mechanism used 
in0. 

The simulation relies on a set of read and write quorums, each being a majority of processors. The sim- 
ulation specifies the write and read procedures, in terms of QuorumRead and QuorumWrite operations. 
The QuorumRead procedure sends a request to every processor, for reading a certain local variable of the 
processor; the procedure terminates with the obtained values, after receiving answers from processors that 
form a quorum. Similarly, the QuorumWrite procedure sends a value to every processor to be written to a 
certain local variable of the processor; it terminates when acknowledgments from a quorum are received. If 
a processor that is inside QuorumRead or QuorumWrite keeps taking steps, then the procedure terminates 
(possibly with arbitrary values). Furthermore, if a processor starts QuorumRead procedure execution, then 
the stabilizing data link [11] ensures that a read of a value returns a value held by the read variable some time 
during its period; similarly, a QuorumWrite(w) procedure execution, causes v to be written to the variable 
during its period. 

Each processor pi maintains a variable, MaxSeqi, which is meant to hold the "largest" sequence number 
the processor has read, p. t maintains in Vi the value that pi knows for the implemented register (which is 
associated with MaxSeqi). 

The write procedure of a value v starts with a QuorumRead of the MaxSeqi variables; upon receiving 
answers l\, fe, • . . from a quorum, the writer picks a sequence number l m that is larger than MaxSeqo 



and h,l2, ... by one; the writer assigns l m to MaxSeqo and calls QuorumWrite with the value (l m ,v). 
Whenever a quorum member pi receives a QuorumWrite request (I, v) for which I is larger than MaxSeqi, 
Pi assigns i to MaxSeqi and v to Uj. 

The read procedure by pi starts with a QuorumRead of both the MaxSeqj and the (associated) Vj 
variables. When pi receives answers {l\,Vi), (I2, V2) • • • from a quorum, pi finds the largest label l m among 
MaxSeqi, and l\, I2, ■ ■ ■ and then calls QuorumWrite with the value (l m , v m ). This ensures that later read 
operations will return this, or a later, value of the register. When QuorumWrite terminates, after a write 
quorum acknowledges, pi assigns l m to MaxSeqi and v m to Vi and returns v m as the value read from the 
register. 

Note that the QuorumRead operation, beginning the write procedure of po, helps to ensure that MaxSeqo 
holds the maximal value, as the writer reads the biggest accessible value (directly read by the writers, or 
propagated to variables that are later read by the writer) in the system during any write. 

Let g{C\) be the number of distinct values greater than MaxSeqo that exist in some configuration C\. 
Since all the processors, except the writer, only copy values and since po can only increment the value of 
MaxSeqo it holds for every i > 1 that 

g(C t ) > g{C l+l ) . 

Furthermore, 

g(Ci) > g(C i+ i) , 

whenever the writer discovers (when executing step Oj) a value greater than MaxSeqo- Roughly speaking, 
the faster the writer discovers these values, the earlier the system stabilizes. If the writer does not discover 
such a value, then the (accessible) portion of the system in which its values are repeatedly written, performs 
reads and writes correctly. 

3.2 Epochs 

As described in the introduction, it is possible that the sequence numbers wrap around faster than planned, 
due to "corrupted" initial values. When the writer discovers that this has happened, it opens a new epoch, 
thereby invalidating all sequence numbers from previous epochs. 

Epochs are denoted with labels from a bounded domain, using a bounded labeling scheme. Such a 
scheme provides a function to compute a new label, which is "larger" than a given set of labels. 

Definition 1 A labeling scheme over a bounded domain C, provides an antisymmetric comparison predicate 
-<b on C and a function Next (5) that returns a label in jC, given some subset S C C of size at most m. It 
is guaranteed that for every L € S, L -<}, Next(,(5). 

Note that the labeling scheme [12], used in the original atomic memory simulation |31 does not cope 
with transient failures. The next section describes a construction of a bounded labeling scheme that can cope 
with badly initialized labels, namely, that does not assume that labels were only generated by using Next. 

Using this scheme, it is guaranteed that if the writer eventually learns about all the epochs in the system, 
it will generate an epoch greater than all of them. After this point, any read that starts after a write of v is 
completed (written to a quorum) returns v (or a later value), since the writer will use increasing sequence 
numbers. 



The eventual convergence of the labeling scheme depends on invoking Next 5 with a parameter S that 
is a superset of the epochs that are in the system. Estimating this set is another challenge for the simulation. 

We explain the intuition of this part of the simulation through the following two-player guessing game, 
between a finder, representing the writer, and a hider, representing an adversary controlling the system. 

- The hider maintains a set of labels H, whose size is at most m (a parameter that will be chosen later). 

- The finder does not know %, but it would like to generate a label greater than all labels in %. 

- The finder generates a label L and if % contains a label V ' , such that it does not hold that L' ^5 L, 
then the hider exposes L' to the finder. 

- In this case, the hider may choose to add L to T-i, however, it must ensure that the size of % remains 
smaller than m (by removing another label). (The finder is unaware of the hiders decision.) 

- If the hider does not expose a new label L' from % the finder wins this iteration and continues to use 
L. 

The finder uses the following strategy. It maintains a fifo queue of 2m labels, meant to track the most 
recent labels. The queue starts with arbitrary values, and during the course of the game, it holds up to m 
recent labels produced by the finder, that turned out to be overruled by existing labels (provided by the 
hider). The queue also holds up to m labels that were revealed to overrule these labels. 

Before the finder chooses a new label, it enqueues its previously chosen label and the label received 
from the hider in response. Enqueuing a label that appears in the queue pushes the label to the head of the 
queue; if the bound on the size of the queue is reached, then the oldest label in the queue is dequeued. 

The finder choose the next label by applying Next, using as parameter the 2m, labels in the queue. 
Intuitively, the queue eventually contains a superset of %, and the finder generates a label greater than all 
the current labels of the hider. 

Lemma 1 All the labels of the hider are smaller than one of the first m+1 labels chosen by the finder. 

Sketch of proof: A simple induction shows that when the finder chooses the «th new label i > 0, the 2i 
items in the front of the queue consist of the first i labels generated by the finder, and the first i labels 
revealed by the hider. 

Note that a response cannot expose a label that has been introduced or previously exposed in the game 
since the finder always choose a label greater than all labels in the queue, in particular these 2i labels. Thus, 
if the finder does not win when introducing the mth label, all the m labels that the hider had when the game 
started were exposed and therefore, stored in the queue of the finder together with all the recent m labels 
introduced by the finder, before the m + 1st label is chosen. Therefore, the m + 1st label is larger than every 
label held by the hider, and the finder wins. □ 

3.3 Timestamps 

The complete simulation tags each value written with a timestamp — a pair (I, i), where I is an epoch chosen 
from a bounded domain £ and i is a sequence number (an integer smaller than some bound r). 



4 A Bounded Labeling Scheme with Uninitialized Values 

Let k > 1 be an integer, and let K = k 2 + 1. We consider the set X = {1, 2, .., K} and let L (the set of 
labels) be the set of all ordered pairs (s, A) where sGlis called in the sequel the sting of X, and ACI 
has size k and is called in the sequel Antistings of X. It follows that \C\ = ( k )K = k( 1+ °( " k . 

The comparison operator -<(, among the bounded labels is defined to be: 

(sj,Aj) -< b (si,Ai) = ( Sj G Ai) A {si G - Aj) 

Note that this operator is antisymmetric by definition, yet may not be defined for every pair (si,Ai) and 
(sj, Aj) in C (e.g., Sj G Ai and s, G A,). 

We define now a function to compute, given a subset S of at most k labels of C, a new label which is 
greater (with respect to -<{,) than every label of S. This function, called Next;, (see Figured]) is as follows. 
Given a subset of k label (si,A\), (s2, Ai), ■ ■ ., (sfc, ^4fc)> we construct a label (si,Ai) which satisfies: 

- Si is an element of X that is not in the union A\ U Ai U . . . U A\. (as the size of each A s is fc, the size 
of the union is at most k 2 , and since X is of size k 2 + 1 such an Sj always exists). 

- A is a subset of size fe of X containing all values (s±, si, ■ ■ ■ , Sk) (if they are not pairwise distinct, 
add arbitrary elements of X to get a set of size exactly k). 
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Next e 




input: S = (si, A\), (52,^2)1 ■ • • , (sfe, A^): set of labels 


input: 5: set of fc timestamps 




output: (s, A): label 


output: (l,i): timestamp 




function: For any 8 ^ S C X, pick(S) returns arbitrary 


1: if 30o, io) 6 S such that 




(later defined for particular cases) element of S 


V(Z,j) £ S,(/,j) ^ (lo,jo),Qd) ^ (h,jo) ./:» ' 


1 


A:={si}U{s 2 }U...U{s fc } 


2: fften return (lo,jo + 1) 




2 


wfe'fe |A \ j^ k 


3: e/se retorn (Next 6 (S), 0) 




3 


A := A U {pick(X \ A)} 






4 


s := picfc (X \ (A U Ai U A 2 U . . . U A fc )) 






5 


return (s, A) 







Figure 1: Next^ and Next e . S denotes the set of labels appearing in S. 
Lemma 2 Given a subset Sofk labels from C, (si, Ai) = Next^(5) satisfies: 

V(sj,Aj) G S,(sj,Aj) ^ 6 (sj,A) 



Proof: Let (sj,Aj) be an element of S. By construction, Sj G Ai and Sj ^ Aj, and the result follows from 
the definition of -<(,. □ 

Note also that it is simple to compute Ai and Si given a set S with A; labels, and can be done in time 
linear in the total length of the labels given, i.e., in 0(k 2 ) time. Since the number of labels \£\ is k^ 1+ °^ k , 

we have that fc is ^^t^' - 

log log | £ | 



Timestamps. A timestamp is a pair (I, i) where I is a bounded epoch, and i is an integer (sequence num- 
ber), ranging from to a fixed bound r > 1. 

The Next e operator compares between two timestamps, and is described in Figure [Q Note that in 
line 3 of the code we use S for the set of labels (with sequence numbers removed) that appear in S. The 
comparison operator -< e for timestamps is: 

(x, i) < e (y, j) = x ^ b y V (x = y A i < j) 
In the sequel, we use -<b to compare timestamps, according to their epochs only. 



5 Putting the Pieces Together 



Each processor pi holds, in MaxTSi, two fields (rnli, c/j), where mli is the timestamp associated with the 
last write of a value to the variable Vi and c/j is a canceling timestamp possibly empty (_L), which is not 
smaller than MaxTSi.ml in the -<{, order. The canceling field is used to let the writer (finder in the game) 
to know an evidence. A timestamp (I, i) is an evidence for timestamp (l',j) if and only if I fa I'. In this 
case the writer will further change the current epoch. 

The pseudo code for the read and write procedures appears in Figure [2] Note that in lines 2 and 9 of 
the write procedure, a label is enqueued if and only if it is not equal to the value stored in MaxTSo- Note 
further, that Next e in line 4 of the writer, first tries to increment the sequence number of the label stored 
in MaxTSo and if the sequence number already equals to the upper bound r then po enqueues the value 
of MaxTSo and use the updated epochs queue to choose a new value for MaxTSo, which is a new epoch 
Next^epoc/is) and sequence number 0. 



writeo(^) 




read 


1 

2 
3 
4 
5 
6 
7 
8 


h,^2,- ■ ■ :=QuorumRead 

ifli ^ MaxTSo then enqueue(epochs, li) 

if\f i k < e MaxTSo then 
MaxTSo ■= Next e (MaxTS ,epochs) 

else 
enqueue(epochs , MaxTSo) 
MaxTSo '■= (Next b (epochs),0) 

QuorumWrite( (MaxTSo, v)) 


1 

2 
3 
4 
5 
6 


({mh,ch),vi), ((mh,cl2),V2),- ■ ■ :=QuorumRead 
if 3m such that cl m = J- and 
(V i ^= m mli -< e ml m and cli ^ e ml m ) then 

QuorumWrite(mZ m , v m ) 

return(u m ) 
else return(_L) 


Upon a request of QuorumWrite (I, v) 

9: if I 5^ MaxTSo then enqueue(epoc/is, I) 


U 
7 
8 


pon a request of QuorumWrite (I, v) 

if MaxTSi.ml -< e I and MaxTS t .cl -< e I then 
MaxTSi ■■= I 




9 


Vi := v 




1 


): else if I / b MaxTSi.ml then MaxTSi. cl ■= I 



Figure 2: write(w) and read. 

The write procedure of a value v starts with a QuorumRead of the MaxTSi variables, and upon 
receiving answers l\, I2, ■ ■ ■ from a quorum, the writer po enqueues to the epochs queue the epochs of the 
received ml and non-_L cl values, which are not equal to MaxTSo (lines 1-2). The writer then computes 
MaxTSo to be the Next e timestamp, namely if the epoch of MaxTSo is the largest in the epochs queue 
and the sequence number of MaxTSo less than r, then po increments the sequence number of MaxTSo by 



one, leaving the epoch of MaxTSo unchanged (lines 3-4). Otherwise, it is necessary to change the epoch: 
Po enqueues MaxTSo to the epochs queue and applies Nextb to obtain an epoch greater than all the ones 
in the epochs queue; it assigns to MaxTSo the timestamp made of this epoch and a zero sequence number 
(lines 6-7). Finally, p executes the QuorumWrite procedure with (MaxTSo, v) (line 8). 

Whenever the writer po receives (as a quorum member) a QuorumWrite request containing an epoch 
that is not equal to MaxTSo, Po enqueues the received label in epochs queue (line 9). 

The read procedure executed by a reader pi starts with a QuorumRead of the MaxTSj and the (asso- 
ciated) Vj variables (line 1). When pi receives answers ((ml\, cl\),vi), ({mfo, c/2), ^2} • • • from a quorum, 
Pi tries to find a maximal timestamp ml m according to the -< e operator from among mli, cl{, ml\, cl\, 
ml2, c/2 • • •• If Pi finds such maximal timestamp ml m , then pi executes the QuorumWrite procedure 
with (ml m , v m ). Once the QuorumWrite terminates (the members of a quorum acknowledged) p^ assigns 
MaxTSi := (ml m , _l_), and Vi := v m and returns v m as the value read from the register (lines 2-5). Other- 
wise, in case no such maximal value ml m exists, the read is aborted (line 6). 

When a quorum member pj receives a QuorumWrite request (l,v), it checks whether both MaxTSi. ml -<b 
I and MaxTSi.cl -<!;, I. If this is the case, then pi assigns MaxTSi := (I, _l_) and Vi := v (lines 7-9). Oth- 
erwise, pi checks whether I fy MaxTSi.ml and if so assigns MaxTSi.cl := / (line 10). 

5.1 Outline of Correctness Proof 

The correctness of the simulation is implied by the game and our previous observations, which we can now 
summarize, recapping the arguments explained in the the description of the individual components. 

In the simulation, the finder/writer may introduce new epochs even when the hider does not introduce 
an evidence. We consider a timestamp (I, i) to be an evidence for timestamp (l',j) if and only if I 7^5 /'. 
Using large enough bound r on the sequence number (e.g., a 64-bit number), we ensure that either there is 
a practically infinite execution in which the finder/writer introduces new timestamps with no epoch change, 
and therefore with growing sequence numbers, and well-defined timestamp ordering, or a new epoch is 
frequently introduced due to the exposure of hidden unknown epochs. The last case follows the winning 
strategy described for the game. 

The sequence numbers allow the writer to introduce many (practically infinite) timestamps without 
storing all of them, as their epoch is identical. The sequence numbers are a simple extension of the bounded 
epochs just as a least significant digit of a counter; allowing the queues to be proportional to the bounded 
number of the labels in the system. Thus, either the writer introduces an epoch greater than any one in 
the system, and hence will use this epoch to essentially implement a register for a practically unbounded 
period, or the readers never introduce some existing bigger epoch letting the writer increment the sequence 
number infinitely often. Note that if the game continues, while the finder is aware of (a superset including) 
all existing epochs, and introduces a greater epoch, there is a practically infinite execution before a new 
epoch is introduced. 

In the scope of simulating a SWMR atomic register, following the first write of a timestamp greater 
than any other timestamp in the system, with a sequence number 0, to a majority quorum, any read in a 
practically infinite execution, will return the last timestamp that has been written to a quorum. In particular, 
if a reader finds a timestamp introduced by the writer that is larger than all other timestamps but not yet 
completely written to a majority quorum, the reader assists in completing the write to a majority quorum 
before returning the read value. 



The memory may stop operate while the set of timestamps does not include a timestamp greater than the 
rest. That is, read operations may be repeatedly aborted until the writer writes new timestamps. Moreover, 
a slow reader may store a timestamp unknown to the rest (and in particular to the writer) and eventually 
introduce the timestamp to the rest. In the first case the convergence of the system is postponed till the 
writer is aware of a superset of the existing timestamps. In the second case the system operate correctly, 
implementing read and write operations, until the timestamp unknown to the rest is introduced. 

Theorem 1 The algorithm eventually reaches a period in which it simulates a SWMR atomic register, for a 
number of operations that is linear in r. 

Each read or write operation requires 0{n) messages. The size of the messages is linear in the size of 
a timestamp, namely the sum of the size of the epoch and log r. The size of an epoch is 0{mlogm) where 
m is the size of the epochs queue, namely, 0(cn 2 ), where c is the capacity of a communication link. 

Note that the size of the epochs queue, and with it, the size of an epoch, is proportional to the number 
of labels that can be stored in a system configuration. Reducing the link capacity will reduce the number of 
labels that can be "hidden" in the communication links. This can be achieved by using a stabilizing data-link 
protocol, [ 1 1 ], in a manner similar to the ping-pong mechanism used in Q. 



6 Conclusion 

We have presented a self-stabilizing simulation of a single-writer multi-reader atomic register, in an asyn- 
chronous message-passing system in which at most half the processors may crash. 

Given our simulation, it is possible to realize a self-stabilizing replicated state machines lfl4l . The 
self-stabilizing consensus algorithms presented in [8] uses SWMR registers, and our simulation allows to 
port them to message-passing systems. More generally, our simulation allows the application of any self- 
stabilizing algorithm that is designed using SWMR registers to work in a message-passing system, where at 
most half the processors may crash. 

Our work leaves open many interesting directions for future research. The most interesting one is to find 
a stabilizing simulation, which will operate correctly even after sequence numbers wrap around, without an 
additional convergence period. This seems to mandate a more carefully way to track epochs, perhaps by 
incorporating a self-stabilizing analogue of the viability construction [3]. Practically it seems that all existing 
epochs will be discovered while an epoch is active for 2 64 sequential writes, and therefore the writer will 
always introduce a grater timestamp. In addition, obviously, one may initialize a system as done in [3] and 
define the next label used by the writer, using our approach, namely our sequence number together with the 
queue data structure and canceling timestamp propagation in an approach similar to fill . 

Acknowledgments. We thank Ronen Kat and Eli Gafni for helpful discussions. 
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Abstract 

A self-stabilizing simulation of a single-writer multi -reader atomic register is presented. The simula- 
tion works in asynchronous message-passing systems, and allows processes to crash, as long as at least a 
majority of them remain working. A key element in the simulation is a new combinatorial construction 
of a bounded labeling scheme that can accommodate arbitrary labels, i.e., including those not generated 
by the scheme itself. 
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1 Introduction 

Distributed systems have become an integral part of virtually all computing systems, especially those of 
large scale. These systems must provide high availability and reliability in the presence of failures, which 
could be either permanent or transient. 

A core abstraction for many distributed algorithms simulates shared memory [3] ; this abstraction allows 
to take algorithms designed for shared memory, and port them to asynchronous message-passing systems, 
even in the presence of failures. There has been significant work on creating such simulations, under var- 
ious types of permanent failures, as well as on exploiting this abstraction in order to derive algorithms for 
message-passing systems. (See a recent survey 0.) 

All these works, however, only consider permanent failures, neglecting to incorporate mechanisms for 
handling transient failures. Such failures may result from incorrect initialization of the system, or from 
temporary violations of the assumptions made by the system designer, for example the assumption that 
a corrupted message is always identified by an error detection code. The ability to automatically resume 
normal operation following transient failures, namely to be self-stabilizing [4], is an essential property that 
should be integrated into the design and implementation of systems. 

This paper presents the first practical self-stabilizing simulation of shared memory that tolerats crashes. 
Specifically, we propose a single-writer multi-reader (SWMR) atomic register in asynchronous message- 
passing systems where less than a majority of processors may crash. A single- writer multi -reader register is 
atomic if each read operation returns the value of the most recent write operation happened before it or the 
value written by a concurent write. 

The simulation is based on reads and writes to a (majority) quorum in a system with a fully connected 
graph topologjlJ. A key component of the simulation is a new bounded labeling scheme that needs no 
initialization, as well as a method for using it when communication links and processes are started at an 
arbitrary state. 

Overview of our simulation. Attiya, Bar-Noy and Dolev [3] presented the first simulation of a SWMR 
atomic register in a message-passing system, supporting two procedures, read and write, for accessing the 
register. This simple simulation is based on a quorum approach: In a write operation, the writer makes sure 
that a quorum of processors (consisting of a majority of the processors, in its simplest variant) store its latest 
value. In a read operation, a reader contacts a quorum of processors, and obtains the latest values they store 
for the register; in order to ensure that other readers do not miss this value, the reader also makes sure that a 
quorum stores its return value. 

A key ingredient of this scheme is the ability to distinguish between older and newer values of the 
register; this is achieved by attaching a sequence number to each register value. In its simplest form, the 
sequence number is an unbounded integer, which is increased whenever the writer generates a new value. 
This solution could be appropriate for a an initialized system, which starts in a consistent configuration, in 
which all sequence numbers are zero, and are only incremented by the writer or forwarded as is by readers. 
In this manner, a 64-bit sequence number will not wrap around for a number of writes that is practically 
infinite, certainly longer than the life-span of any reasonable system. 



'Note that the use of standard end-to-end schemes can be used to implement the quorum operation in the case of general 
communication graph. 



However, when there are transient failures in the system, as is the case in the context of self-stabilization, 
the simulation starts at an uninitialized state, where sequence numbers are not necessarily all zero. It is 
possible that, due to a transient failure, the sequence numbers might hold the maximal value when the 
simulation starts running, and thus, will wrap around very quickly. 

Our solution is to partition the execution of the simulation into epochs, namely periods during which 
the sequence numbers are supposed not to wrap around. Whenever a "corrupted" sequence number is 
discovered, a new epoch is started, overriding all previous epochs; this repeats until no more corrupted 
sequence numbers are hidden in the system, and the system stabilizes. Ideally, in this steady state, after the 
system stabilizes, it will remain in the same epoch (at least until all sequence numbers wrap around, which 
is unlikely to happen). 

This raises, naturally, the question of how to label epochs. The natural idea, of using integers, is bound 
to run into the same problems as for the sequence numbers. Instead, we capitalize on another idea from Q, 
of using a bounded labeling scheme for the epochs. A bounded labeling scheme (HOH provides a function 
for generating labels (in a bounded domain), and guarantees that two labels can be compared to determine 
the largest among them. 

Existing labeling schemes assume that initially, labels have specific initial values, and that new labels 
are introduced only by means of the label generation function. However, transient failures, of the kind the 
self-stabilizing simulation must withstand, can create incomparable labels, so it is impossible to tell which 
is the largest among them or to pick a new label that is bigger than all of them. 

To address this difficulty, we present a constructive bounded labeling scheme that allows to define a label 
larger than any set of labels, provided that its size is bounded. We assume links have bounded capacity, and 
hence the number of epochs initially hidden in the system is bounded. 

The writer tracks the set of epochs it has seen recently; whenever the writer discovers that its current 
epoch is not the largest, or is incomparable to some existing epoch, the writer generates a new epoch 1 that is 
larger than all the epochs it has. The number of bits required to represent a label depends on m, the maximal 
size of the set, and it is in 0(m log m). We ensure that the size of the set is proportional to the total capacity 
of the communication links, namely, 0(cn 2 ), where c is the bound on the capacity of each link, and hence, 
each epoch requires O ((en 2 (log n + logc)) bits. 

It is possible to reduce this complexity, making c essentially constant, by employing a data-link protocol 
for communication among the processors. 

We show that, after a bounded number of write operations, the results of reads and writes can be totally 
casually ordered in a manner that respects the read-time order of non-overlapping operations, so that the 
sequence of operations satisfies the semantics of a SWMR register. This holds until the sequence numbers 
wrap around, as can happen in a realistic version of the unbounded ABD simulation. 

Related work. Self-stabilizing simulation of an atomic single-writer single-reader shared registers, on a 
message-passing system, was presented in (6). This simulation does not address SWMR register. Moreover, 
the simulation cannot withstand processor crashes. More recent BUI] papers focused on self-stabilizing 
simulation of shared registers using weaker shared registers. Self-stabilizing timestamps implementations 
using SWMR atomic registers were suggested in HE]]. These implementations already assume the existence 
of a shared memory, while, in contrast, we simulate a shared SWMR atomic register using message passing. 



2 Preliminaries 

A message-passing system consists of n processors, Po,Pi,P2, ■ ■ ■ ,Pn-i> connected by communication links 
through which messages are sent and received. We assume that the underlying communication graph is 
completely connected, namely, every pair of processors, pi and pj, have a communication link. 

A processor is modeled by a state machine that executes steps. In each step, the processor changes its 
state, and executes a single communication operation, which is either a send message operation or a receive 
message operation. The communication operation changes the state of an attached link, in the natural 
manner. 

The system configuration is a vector of n states, a state for each processors and 2(n 2 — n) sets, each 
bounded by a constant message capacity c. A set Sij (rather than a queue, reflects the non-fifo nature) for 
each directed edge (i, j) from a processor pi to a processor pj. Note that in the scope of self-stabilization, 
where the system copes with an arbitrary starting configuration, there is no deterministic data-link simulation 
that use bounded memory when the capacity of links is unbounded [6]. 

An execution is a sequence of configurations and steps, E = {C\,a\, C2, «2 • • •) such that d, i > 1, 
is obtained by applying aj_i to Cj_i, where aj_i is a step of a single processor, pj, in the system. Thus, 
the vector of states, except the state of pj, in Cj_i and Ci are identical. In case the single communication 
operation in aj_i is a send operation to pk then Sjk in Q is a union of Sjk in Cj_i with the message sent 
in Oi_i. If the obtained union does not respect the message bound \sjk\ = c then an arbitrary message 
in the obtained union is deleted. The rest of the message sets are kept unchanged. In case, the single 
communication operation in a\-\ is a receive operation of a (non null) message m, then m (must exist in Skj 
of Cj_i and) is removed from Skj, all the rest of the sets are identical in Cj_i and Cj. A receive operation 
by pj from p^ may result in a null message even when the Skj is not empty, thus allowing unbounded delay 
for any particular message. Message losses are modeled by allowing spontaneous message removals from 
the set. An edge (i, j) is operational if a message sent infinitely often by pi is received infinitely often by pj. 

For the simulation of a single writer multi-reader (SWMR) atomic register, we assume po is the writer 
and pi,p2, ■ • • ,Pn-i are the readers, po has a write procedure/operation and the readers have read proce- 
dure/operation. The sub-execution between the step that starts a write procedure and the next step that ends 
the write procedure execution defines a write period. Similarly, for a particular read by processor pi, the 
sub-executions between the step that starts a read procedure by processors pi and the next step that ends the 
read procedure execution of pi defines a read period. 

SWMR atomic register. A single- writer multi -reader atomic register supplies two operations: read 
and write. An invocation of a read or write translates into a sequence of computation steps. A sequence 
of invocations of read and write operations generates an execution in which the computation steps corre- 
sponding to different invocations are interleaved. An operation op\ happens before an operation op2 in this 
execution, if op\ returns before op2 is invoked. Two operations overlap if neither of them happens before 
the other. Each interleaved execution of an atomic register is required to be linearizable lfl4l . that is, it 
must be equivalent to an execution in which the operations are executed sequentially, and the order of non- 
overlapping operations is preserved. The main difference between a regular register (a register that satisfies 
the property that every read retuns the value written by the most recent write or by a concurrent write) and 
an atomic register is the absence for the latter of the new/old inversions. Consider two consecutively reads 
r\, r2 and two consecutive writes w\, u>2 of a regular register such that r% is concurrent with both w\ and 



2 Two operations op\ and op2 are consecutives if op\ is the most recent operation that happens before op2- 



w<i and T2 is concuiTent only with w% The regularity property allows r-2 to return the value writen by w± 
and n to return the value writen by w^. This phenomena is called the new/old inversion. 

An atomic register prevents in all executions the new/old inversions. 

Formally, an atomic register verifies the following two properties: 

• Regularity property. A read operation returns either the value written by the most recent write 
operation that happend before the read or a value written by a concurrent write. 

• No new old/inversions If a read operation r\ reads a value from a concurrent write operation W2 then 
no read operation that happens after n reads a value from a write operation w\ that happens before 
t^- 
Practical stabilizing SWMR atomic register. A message passing system simulates a SWMR atomic 

register in a practical stabilizing manner, if any infinite execution starting in arbitrary configuration in which 
the writer writes infinitely often has a sub-execution with a practically infinite number of write operations, 
in which the atomicity requirement holds. A practically infinite execution is an execution of at least 2 k steps, 
for some large k; for example, k = 64 is big enough for any practical system. 

3 Overview of the Algorithm 

3.1 The Basic Quorum-Based Simulation 

We describe the basic simulation, which follows the quorum-based approach of 0, and ensures that our 
algorithm tolerates (crash) failures of less than a majority of the processors. Our simulation assumes the 
existance of an underlying stabilizing data-link protocol, lfl~3l . similar to the ping-pong mechanism used 
in0. 

The simulation relies on a set of read and write quorums, each being a majority of processors. The sim- 
ulation specifies the write and read procedures, in terms of QuorumRead and QuorumWrite operations. 
The QuorumRead procedure sends a request to every processor, for reading a certain local variable of the 
processor; the procedure terminates with the obtained values, after receiving answers from processors that 
form a quorum. Similarly, the QuorumWrite procedure sends a value to every processor to be written to a 
certain local variable of the processor; it terminates when acknowledgments from a quorum are received. If 
a processor that is inside QuorumRead or QuorumWrite keeps taking steps, then the procedure terminates 
(possibly with arbitrary values). Furthermore, if a processor starts QuorumRead procedure execution, then 
the stabilizing data link [ 13] ensures that a read of a value returns a value held by the read variable some time 
during its period; similarly, a QuorumWrite(w) procedure execution, causes v to be written to the variable 
during its period. 

Each processor pi maintains a variable, MaxSeqi, which is meant to hold the "largest" sequence number 
the processor has read, p. t maintains in Vi the value that pi knows for the implemented register (which is 
associated with MaxSeqi). 

The write procedure of a value v starts with a QuorumRead of the MaxSeqi variables; upon receiving 
answers l\, fe, • . . from a quorum, the writer picks a sequence number l m that is larger than MaxSeqo 



and h,l2, ... by one; the writer assigns l m to MaxSeqo and calls QuorumWrite with the value (l m ,v). 
Whenever a quorum member pi receives a QuorumWrite request (I, v) for which I is larger than MaxSeqi, 
Pi assigns i to MaxSeqi and v to Uj. 

The read procedure by pi starts with a QuorumRead of both the MaxSeqj and the (associated) Vj 
variables. When pi receives answers {l\,Vi), (I2, V2) • • • from a quorum, pi finds the largest label l m among 
MaxSeqi, and l\, I2, ■ ■ ■ and then calls QuorumWrite with the value (l m , v m ). This ensures that later read 
operations will return this, or a later, value of the register. When QuorumWrite terminates, after a write 
quorum acknowledges, pi assigns l m to MaxSeqi and v m to Vi and returns v m as the value read from the 
register. 

Note that the QuorumRead operation, beginning the write procedure of po, helps to ensure that MaxSeqo 
holds the maximal value, as the writer reads the biggest accessible value (directly read by the writers, or 
propagated to variables that are later read by the writer) in the system during any write. 

Let g{C\) be the number of distinct values greater than MaxSeqo that exist in some configuration C\. 
Since all the processors, except the writer, only copy values and since po can only increment the value of 
MaxSeqo it holds for every i > 1 that 

g(C t ) > g{C l+l ) . 

Furthermore, 

g(Ci) > g(C i+ i) , 

whenever the writer discovers (when executing step Oj) a value greater than MaxSeqo- Roughly speaking, 
the faster the writer discovers these values, the earlier the system stabilizes. If the writer does not discover 
such a value, then the (accessible) portion of the system in which its values are repeatedly written, performs 
reads and writes correctly. 

3.2 Epochs 

As described in the introduction, it is possible that the sequence numbers wrap around faster than planned, 
due to "corrupted" initial values. When the writer discovers that this has happened, it opens a new epoch, 
thereby invalidating all sequence numbers from previous epochs. 

Epochs are denoted with labels from a bounded domain, using a bounded labeling scheme. Such a 
scheme provides a function to compute a new label, which is "larger" than a given set of labels. 

Definition 1 A labeling scheme over a bounded domain C, provides an antisymmetric comparison predicate 
-<b on C and a function Next (5) that returns a label in jC, given some subset S C C of size at most m. It 
is guaranteed that for every L € S, L -<}, Next(,(5). 

Note that the labeling scheme iflOl . used in the original atomic memory simulation |31 does not cope 
with transient failures. The next section describes a construction of a bounded labeling scheme that can cope 
with badly initialized labels, namely, that does not assume that labels were only generated by using Next. 

Using this scheme, it is guaranteed that if the writer eventually learns about all the epochs in the system, 
it will generate an epoch greater than all of them. After this point, any read that starts after a write of v is 
completed (written to a quorum) returns v (or a later value), since the writer will use increasing sequence 
numbers. 



The eventual convergence of the labeling scheme depends on invoking Next 5 with a parameter S that is 
a superset of the epoches that are in the system. Estimating this set is another challenge for the simulation. 

We explain the intuition of this part of the simulation through the following two-player guessing game, 
between a finder, representing the writer, and a hider, representing an adversary controlling the system. 

- The hider maintains a set of labels %, whose size is at most m (a parameter that will be chosen later). 

- The finder does not know %, but it would like to generate a label greater than all labels in %. 

- The finder generates a label L and if % contains a label V , such that it does not hold that V ^5 L, 
then the hider exposes V to the finder. 

- In this case, the hider may choose to add L to %, however, it must ensure that the size of % remains 
smaller than m (by removing another label). (The finder is unaware of the hiders decision.) 

- If the hider does not expose a new label L' from % the finder wins this iteration and continues to use 
L. 

The finder uses the following strategy. It maintains a fifo queue of 2m labels, meant to track the most 
recent labels. The queue starts with arbitrary values, and during the course of the game, it holds up to m 
recent labels produced by the finder, that turned out to be overruled by existing labels (provided by the 
hider). The queue also holds up to m labels that were revealed to overrule these labels. 

Before the finder chooses a new label, it enqueues its previously chosen label and the label received 
from the hider in response. Enqueuing a label that appears in the queue pushes the label to the head of the 
queue; if the bound on the size of the queue is reached, then the oldest label in the queue is dequeued. This 
semantics of enqueue is used throughout the paper. 

The finder choose the next label by applying Next, using as parameter the 2m labels in the queue. 
Intuitively, the queue eventually contains a superset of %, and the finder generates a label greater than all 
the current labels of the hider. 

Lemma 1 All the labels of the hider are smaller than one of the first m + 1 labels chosen by the finder. 

Sketch of proof: A simple induction shows that when the finder chooses the «th new label % > 0, the 2% 
items in the front of the queue consist of the first i labels generated by the finder, and the first i labels 
revealed by the hider. 

Note that a response cannot expose a label that has been introduced or previously exposed in the game 
since the finder always choose a label greater than all labels in the queue, in particular these 2i labels. Thus, 
if the finder does not win when introducing the mfh label, all the m labels that the hider had when the game 
started were exposed and therefore, stored in the queue of the finder together with all the recent m labels 
introduced by the finder, before the m + 1st label is chosen. Therefore, the m + 1st label is larger than every 
label held by the hider, and the finder wins. □ 

3.3 Timestamps 

The complete simulation tags each value written with a timestamp — a pair (I, i), where I is an epoch chosen 
from a bounded domain £ and i is a sequence number (an integer smaller than some bound r). 



4 A Bounded Labeling Scheme with Uninitialized Values 

Let k > 1 be an integer, and let K = k 2 + 1. We consider the set X = {1, 2, .., K} and let L (the set of 
labels) be the set of all ordered pairs (s, A) where sGlis called in the sequel the sting of X, and ACI 
has size k and is called in the sequel Antistings of X. It follows that \C\ = ( k )K = k( 1+ °( " k . 

The comparison operator -<b among the bounded labels is defined to be: [[i and j replaced]] 

(sj,Aj) -< b (si,Ai) = ( Sj G Ai) A {si G - Aj) 

Note that this operator is antisymmetric by definition, yet may not be defined for every pair (si,Ai) and 
(sj, Aj) in C (e.g., Sj G Ai and s, G A,). 

We define now a function to compute, given a subset S of at most k labels of C, a new label which is 
greater (with respect to -<{,) than every label of S. This function, called Next;, (see Figured]) is as follows. 
Given a subset of k label (si,A\), (s2, Ai), ■ ■ ., (sfc, ^4fc)> we construct a label (si,Ai) which satisfies: 

- Si is an element of X that is not in the union A\ U Ai U . . . U A\. (as the size of each A s is k, the size 
of the union is at most k 2 , and since X is of size k 2 + 1 such an Sj always exists). 

- A is a subset of size k of X containing all values (s±, si, ■ ■ ■ , Sk) (if they are not pairwise distinct, 
add arbitrary elements of X to get a set of size exactly k). 
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function: For any 7^ 5 C X, pick(S) returns arbitrary 


1: if 30o, io) 6 S such that 




(later defined for particular cases) element of S 


V(Z,j) £ S,(/,j) 7^ (la,jo),(lJ) ^ (h,jo) ./:» ' 


1 


A:={si}U{s 2 }U...U{s fc } 


2: fften return (lo,jo + 1) 




2 


while \A\ ^ k 


3: e/se retorn (Next 6 (S), 0) 




3 


A:= Au {pick(X \ A)} 






4 


s := pick (X \ (A U A\ U A 2 U . . . U A k )) 






5 


return (s,A) 







Figure 1: Next^ and Next e . S denotes the set of labels appearing in S. 
Lemma 2 Given a subset Sofk labels from C, (si, Ai) = Next^(5) satisfies: 

V(sj,Aj) G S,(sj,Aj) ^ 6 (si,Ai) 



Proof Sketch: Let (sj,Aj) be an element of S. By construction, Sj G Ai and Si £ Aj, and the result 
follows from the definition of -<;,. □ 

Note also that it is simple to compute Ai and Si given a set S with k labels, and can be done in time 
linear in the total length of the labels given, i.e., in 0(k 2 ) time. Since the number of labels \£\ is k^ 1+ °^ k , 
we have that k is (1 + o( ^ t^' - 

log log | £ | 



Timestamps. A timestamp is a pair (I, i) where I is a bounded epoch, and i is an integer (sequence num- 
ber), ranging from to a fixed bound r > 1. 

The Next e operator compares between two timestamps, and is described in Figure [Q Note that in 
line 3 of the code we use S for the set of labels (with sequence numbers removed) that appear in S. The 
comparison operator -< e for timestamps is: 

(x, i) -<e (y, j) = x -<b y V (x = y A i < j) 
In the sequel, we use -<;, to compare timestamps, according to their epochs only. 



5 Putting the Pieces Together 



Each processor pi holds, in MaxTSi, two fields (rnli, c/j), where mli is the timestamp associated with the 
last write of a value to the variable Vi and cli is a canceling timestamp possibly empty (_L), which is not 
smaller than MaxTSi.ml in the -<& order. The canceling field is used to let the writer (finder in the game) 
to know an evidence. A timestamp (I, i) is an evidence for timestamp (I', j) if and only if I ^ V . When the 
writer faces an evidence it changes the current epoch. 

The pseudo code for the read and write procedures appears in Figure [2] Note that in lines 2 and 9 of 
the write procedure, a label is enqueued if and only if it is not equal to MaxTSo- Note further, that Next e 
in line 4 of the writer, first tries to increment the sequence number of the label stored in MaxTSo and if 
the sequence number already equals to the upper bound r then po enqueues the value of MaxTSo and use 
the updated epochs queue to choose a new value for MaxTSo, which is a new epoch Next&(epoc/ts) and 
sequence number 0. 



writeoff) 


read 


1: ({mh,ch),vi), ((ml2,cl2),v2), ■ ■ ■ :=QuorumRead 


1: ({mh,ch),vi), ((mh , ch) , V2) , ■ ■ ■ :=QuorumRead 


2 


Vi, if mli 7^ MaxTSo then enqueue(epochs , mli) 


2 


if 3m such that cl m = _l_ and 


3 


Vi, ifcli ^ MaxTSo then enqueue(epoc/is , cli) 


3 


(V i 5^ m mli -<« ml m and cli -<e fnlm) then 


4 


if"i I G epochs I < e MaxTSo then 


4 


QuorumWrite(m/ m , v m ) 


5 


MaxTSo ■= (Next e (MaxTS ,epochs), JL) 


5 


return(?) m ) 


6 


else 


6 


else return(±) 


7 


enqueue(epochs , MaxTSo) 




8 


MaxTSo ■= {(Next b (epochs),0), ±) 




9 


QuorumWr\te((MaxTS -ml,v)) 






Upon a request of QuorumWrite (I, v) 


Upon a request of QuorumWrite (I, v) 


1 


if MaxTSi.ml -< e I and MaxTSi. cl <e I then 


10: if I ^ MaxTSo then enqueue(epoc/is, I) 


8 

Q 


MaxTSi ■= (I, -L> 




7 V% . u 

10: else if I / b MaxTSi.ml and MaxTSi.ml ^ I 




then MaxTSi. cl ■= I 



Figure 2: write(w) and read. 

The write procedure of a value v starts with a QuorumRead of the MaxTSi variables, and upon 
receiving answers l\, 1%, . . . from a quorum, the writer po enqueues to the epochs queue the epochs of the 
received ml and non-_L cl values, which are not equal to MaxTSo (lines 1-3). The writer then computes 
MaxTSo to be the Next e timestamp, namely if the epoch of MaxTSo is the largest in the epochs queue 



and the sequence number of MaxTSo less than r, then po increments the sequence number of MaxTSo by 
one, leaving the epoch of MaxTSo unchanged (lines 4-5). Otherwise, it is necessary to change the epoch: 
Po enqueues MaxTSo to the epochs queue and applies Nextb to obtain an epoch greater than all the ones 
in the epochs queue; it assigns to MaxTSo the timestamp made of this epoch and a zero sequence number 
(lines 7-8). Finally, p executes the QuorumWrite procedure with (MaxTS ,v) (line 9). 

Whenever the writer po receives (as a quorum member) a QuorumWrite request containing an epoch 
that is not equal to MaxTSo, Po enqueues the received label in epochs queue (line 10). 

The read procedure executed by a reader pi starts with a QuorumRead of the MaxTSj and the (asso- 
ciated) Vj variables (line 1). When pi receives answers {{ml\, cl\), v\), ((m/2, c/2), V2) ■ ■ ■ from a quorum, 
Pi tries to find a maximal timestamp ml m according to the -< e operator from among m/j, ck, ml\, cl±, 
m/2, c/2 • • •• If Pi finds such maximal timestamp ml m , then pi executes the QuorumWrite procedure 
with (ml m , v m ). Once the QuorumWrite terminates (the members of a quorum acknowledged) pi assigns 
MaxTSi := (ml m , _l_), and Vi := v m and returns v m as the value read from the register (lines 2-5). Other- 
wise, in case no such maximal value ml m exists, the read is aborted (line 6). 

When a quorum member pj receives a QuorumWrite request (l,v), it checks whether both MaxTSi. ml -<b 
I and MaxTSi.cl -<!& I. If this is the case, then pi assigns MaxTSi := (I, _L) and Vi := v (lines 7-9). Oth- 
erwise, pi checks whether / 7^ MaxTSi.ml and if so assigns MaxTSi.cl := / (line 10). Note that _L -<b I, 
for any /. 

Note that we assume the existance of an underlying data-link protocol that emulates FIFO links over a 
non-FIFO communication environment. In the following we assume that the data-link protocol also helps 
in repeatedly transmit the value of MaxTS from one processor to another. In case the MaxTSi.cl of a 
processor pi is ± and pi receives from a neighbor pj a MaxTSj such that MaxTSj.ml fy MaxTSi.ml 
then pi assigns MaxTSi.cl := MaxTSj.ml, otherwise, when MaxTSj.cl 7^ MaxTSi.ml then pi as- 
signs MaxTSi.cl := MaxTSj.cl. Note also that the writer will enqueue every diffused value different 
from MaxTSo. The code is identical to line 9 in the writer code. 

5.1 Outline of Correctness Proof 

The correctness of the simulation is implied by the game and our previous observations, which we can now 
summarize, recapping the arguments explained in the the description of the individual components. 

In the simulation, the finder/writer may introduce new epochs even when the hider does not introduce 
an evidence. We consider a timestamp (/, i) to be an evidence for timestamp (l',j) if and only if / 7^5 /'. 
Using large enough bound r on the sequence number (e.g., a 64-bit number), we ensure that either there is 
a practically infinite execution in which the finder/writer introduces new timestamps with no epoch change, 
and therefore with growing sequence numbers, and well-defined timestamp ordering, or a new epoch is 
frequently introduced due to the exposure of hidden unknown epochs. The last case follows the winning 
strategy described for the game. 

The sequence numbers allow the writer to introduce many (practically infinite) timestamps without 
storing all of them, as their epoch is identical. The sequence numbers are a simple extension of the bounded 
epochs just as a least significant digit of a counter; allowing the queues to be proportional to the bounded 
number of the labels in the system. Thus, either the writer introduces an epoch greater than any one in 
the system, and hence will use this epoch to essentially implement a register for a practically unbounded 



period, or the readers never introduce some existing bigger epoch letting the writer increment the sequence 
number infinitely often. Note that if the game continues, while the finder is aware of (a superset including) 
all existing epochs, and introduces a greater epoch, there is a practically infinite execution before a new 
epoch is introduced. 

In the scope of simulating a SWMR atomic register, following the first write of a timestamp greater 
than any other timestamp in the system, with a sequence number 0, to a majority quorum, any read in a 
practically infinite execution, will return the last timestamp that has been written to a quorum. In particular, 
if a reader finds a timestamp introduced by the writer that is larger than all other timestamps but not yet 
completely written to a majority quorum, the reader assists in completing the write to a majority quorum 
before returning the read value. 

The memory may stop operate while the set of timestamps does not include a timestamp greater than the 
rest. That is, read operations may be repeatedly aborted until the writer writes new timestamps. Moreover, 
a slow reader may store a timestamp unknown to the rest (and in particular to the writer) and eventually 
introduce the timestamp to the rest. In the first case the convergence of the system is postponed till the 
writer is aware of a superset of the existing timestamps. In the second case the system operate correctly, 
implementing read and write operations, until the timestamp unknown to the rest is introduced. 

Theorem 1 The algorithm eventually reaches a period in which it simulates a SWMR atomic register, for a 
number of operations that is linear in r. 

Each read or write operation requires 0(n) messages. The size of the messages is linear in the size of 
a timestamp, namely the sum of the size of the epoch and log r. The size of an epoch is 0{mlogm) where 
m is the size of the epochs queue, namely, 0(cn 2 ), where c is the capacity of a communication link. 

Note that the size of the epochs queue, and with it, the size of an epoch, is proportional to the number 
of labels that can be stored in a system configuration. Reducing the link capacity will reduce the number of 
labels that can be "hidden" in the communication links. This can be achieved by using a stabilizing data-link 
protocol, [ 13 ], in a manner similar to the ping-pong mechanism used in 01 • 

6 Conclusion 

We have presented a self-stabilizing simulation of a single-writer multi-reader atomic register, in an asyn- 
chronous message-passing system in which at most half the processors may crash. 

Given our simulation, it is possible to realize a self-stabilizing replicated state machines lPT2l . The 
self-stabilizing consensus algorithms presented in [7] uses SWMR registers, and our simulation allows to 
port them to message-passing systems. More generally, our simulation allows the application of any self- 
stabilizing algorithm that is designed using SWMR registers to work in a message-passing system, where at 
most half the processors may crash. 

Our work leaves open many interesting directions for future research. The most interesting one is to find 
a stabilizing simulation, which will operate correctly even after sequence numbers wrap around, without an 
additional convergence period. This seems to mandate a more carefully way to track epochs [[]], perhaps by 
incorporating a self-stabilizing analogue of the viability construction lf3l . 

Acknowledgments. We thank Ronen Kat and Eli Gafni for helpful discussions. 
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Anexes 

Lemma 3 Every execution has an infinite suffix where every hidden timestamp is eventually revealed to the 
writter or stays hidden forever (not revealed neither to the writter nor to a reader) . 

Proof Sketch: Consider an execution where a timestamp is not revealed directly to the writter but to some 
clean reader (a reader with canceling setted to _L). The other cases are trivial. Let I be the timestamp 
and i be the reader. Following the description of the code piggy-backed by the data-link then i compares 
MaxTSi.ml with I. If / -fi e MaxTSi then MaxTSi.cl is setted to /. Then, either the writter contacts the 
reader via a QuorumRead and gets the canceling field or the reader is contacted by another clean reader and 
the canceling is propagated. Eventually, the writter will get the canceled timestamp and enqueues it. □ 

Lemma 4 Each infinite execution has an infinite suffix where every QuorumRead invocation by a reader 
returns a maximum clean timestamp. 

Proof Sketch: We prove in the following that the prefix where QuorumRead invocation by a reader returns 
either canceled timestamps or timestamps that do not have a clean maximum is finite. The proof is by 
construction. Every write operation invokes a Quorum Write with a clean timestamp that is greater than any 
timestamp the writter is aware of. Therefore, every QuorumRead invoked after the Quorum Write invocation 
captures this value. According to Lemma [3] every hidden timestamp is eventually either revealed to the 
writter and enqueued or stays hidden. Since the number of hidden values is bounded, the writter enqueues 
these values in a finite time. Consider the execution after the writer enqueues the last hidden value. The 
next write operation produces a timestamp that is greater than any timestamp that will be ever revealed in 
the execution and any QuorumRead invoked after the execution of this write will get this timestamp. □ 

Lemma 5 Each execution of the system has an infinite suffix where reads do not abort. 

Proof Sketch: According to Lemma every execution has an infinite suffix where each QuorumRead invo- 
cation returns a maximum clean timestamp. It follows that for every read invocation, the conditions in lines 
2 and 3 (reader's code) are satisfied and the value returned by the read is not _L. □ 

Lemma 6 Any execution of the system has an infinite suffix that satisfies the regularity property. 

Proof Sketch: Let e be an infinite execution of the system. Following Lemma [5] and Lemma |3l e contains 
an infinite suffix, e', where any read returns a not abort value and any write includes in its decision set all 
the labels in the system. Assume there is a process p such that it read invocations allways return an obsolete 
value. That is, the value returned by the read is either a hidden value or a value corresponding to a previous 
write but not the most recent. Let r be such a read. In e' , r returns the output value with the maximum 
timestamp over the set of labels returned by QuorumRead. Let w\ and W2 be two write operations such that 
w\ happens before u>2 and r. Since vo\ happens before r then the label computed by w\ is written in at 
least a majority of processes via a Quorum Write and is greater than any label in the system. When r starts 
invoking QuorumRead two cases may appear: (l)w2 didn't modify the value written by w\ and didn't start 
its promotion via Quorum Write or (2)W2 executes Quorum Write but didn't finish its execution. In the first 
case, wis MaxTS is the largest in the system. When r invokes the QuorumRead it gets u>i's MaxTS value 
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(otherwise w\ is not terminated) and returns it. Hence, r cannot return a value older than the one written 
by w\. In the second case, some processes contacted in the QuorumRead may send the wi's MaxTS, other 
processes the u>2's MaxTS. Since the MaxTS computation at the writter is sequential then W2's MaxTS is 
greater than w±s MaxTS. Then following lines 2 and 3 in the reader code, r should return u>2's MaxTS. 
Hence, r will return the last written value. □ 

Lemma 7 Any execution of the system has an infinite suffix that satisfies the no new/old inversion property. 

Proof Sketch: Let e be an execution of the system. Following Lemmas [5] and [6l e has an infinite suffix, e' , 
that satisfies the regularity property and in which any read invocation does not return abort. In the following 
we prove that e' does not violate the new/old inversion property. Consider two write operations w\ and W2 in 
e' such that w\ happens before W2- Consider also two read operations r\ and T2 such that t\ happens before 
r2 and w\ happens before rjj. Assume r\ and r2 are concurrent with W2- Assume a new/old inversion 
happens and r\ returns the value written by u>2. Let denote the MaxTS of this value with li. Assume also T2 
retuns the value written by w\ whch MaxTS is l\. Since t\ happens before r2 then before the start of T2, T\ 
executes the following actions: it modifies its MaxTS to I2, it also executes Quorum Write in order to inform 
the system of its new value. Since Quorum Write retuns before the r\ finishes then I2 is already adopted by 
at least a majority of processes. That is, since I2 >- e h (wi happens before W2), then I2 replaces l\ in at least 
a majority of processes. 

We assumed T2 returns l\. Since r\ happens before T2 then T2 starts its QuorumRead after r\ returned so 
after v\ completed its Quorum Write operation. This implies that I2 is the label adopted by at least a majority 
of processes and at least one process in this majority will respond while T2 invokes its QuorumRead. That 
is, the T2 collects at least one label I2 and since I2 )~ e h, ^2 should return this value. This contradicts the 
assumption r2 retuns l\. It follows that e' verifies the no new/old inversion property. □ 



'Following the transivity of the relation happens before, w\ also happens before ri. 



13 



